Information Filtering using Index Word Selection based on the Topics
نویسندگان
چکیده
We have proposed an information filtering system using index word selection from a document set based on the topics included in a set of documents. This method narrows down the particularly characteristic words in a document set and the topics are obtained by Sparse Non-negative Matrix Factorization. In information filtering, a document is often represented with the vector in which the elements correspond to the weight of the index words, and the dimension of the vector becomes larger as the number of documents is increased. Therefore, it is possible that useless words as index words for the information filtering are included. In order to address the problem, the dimension needs to be reduced. Our proposal reduces the dimension by selecting index words based on the topics included in a document set. We have applied the Sparse Non-negative Matrix Factorization to the document set to obtain these topics. The filtering is carried out based on a centroid of the learning document set. The centroid is regarded as the user’s interest. In addition, the centroid is represented with a document vector whose elements consist of the weight of the selected index words. Using the English test collection MEDLINE, thus, we confirm the effectiveness of our proposal. Hence, our proposed selection can confirm the improvement of the recommendation accuracy from the other previous methods when selecting the appropriate number of index words. In addition, we discussed the selected index words by our proposal and we found our proposal was able to select the index words covered some minor topics included in the document set. Keywords— Information Filtering, Sparse NMF, Index word Selection, User Profile, Chi-squared Measure
منابع مشابه
تحلیل ساختار واژگان و مفاهیم مقالات علم اطلاعات و دانششناسی بر اساس تحلیل شبکۀ اجتماعی در پایگاه وبگاه علم در دو دورۀ قبل و بعد از پیدایش وب (1993-1997 و 2009-2013)
Purpose: This study aimed at the identification and analyzes of “Knowledge and Information Science (KIS)” scientific articles structure using co-word analysis in the “Web of Science (WoS)” database (1993-1997 & 2009-2013). By co-word analysis of the KIS articles, subjects and concepts of KIS were identified. Methodology: This study has based on descriptive and functional approach and on co-wor...
متن کاملیک مدل موضوعی احتمالاتی مبتنی بر روابط محلّی واژگان در پنجرههای همپوشان
A probabilistic topic model assumes that documents are generated through a process involving topics and then tries to reverse this process, given the documents and extract topics. A topic is usually assumed to be a distribution over words. LDA is one of the first and most popular topic models introduced so far. In the document generation process assumed by LDA, each document is a distribution o...
متن کاملA New Similarity Measure Based on Item Proximity and Closeness for Collaborative Filtering Recommendation
Recommender systems utilize information retrieval and machine learning techniques for filtering information and can predict whether a user would like an unseen item. User similarity measurement plays an important role in collaborative filtering based recommender systems. In order to improve accuracy of traditional user based collaborative filtering techniques under new user cold-start problem a...
متن کاملیک سامانه توصیهگر ترکیبی با استفاده از اعتماد و خوشهبندی دوجهته بهمنظور افزایش کارایی پالایشگروهی
In the present era, the amount of information grows exponentially. So, finding the required information among the mass of information has become a major challenge. The success of e-commerce systems and online business transactions depend greatly on the effective design of products recommender mechanism. Providing high quality recommendations is important for e-commerce systems to assist users i...
متن کاملInformation filtering based on wiki index database
In this paper we present a profile-based approach to information filtering by an analysis of the content of text documents. The Wikipedia index database is created and used to automatically generate the user profile from the user’s document collection. The problem-oriented Wikipedia subcorpora are created (using knowledge extracted from the user profile) for each topic of user interests. The in...
متن کامل